Data

The aim of this project will be to perform clustering of houses located in California, based on the following attributes:

  1. longitude: A measure of how far west a house is;

  2. latitude: A measure of how far north a house is;

  3. housingMedianAge: Median age of a house within a block; a lower number is a newer building

  4. totalRooms: Total number of rooms within a block

  5. totalBedrooms: Total number of bedrooms within a block

  6. population: Total number of people residing within a block

  7. households: Total number of households in each block

  8. medianIncome: Median income for households within a block of houses (measured in tens of thousands of US Dollars)

  9. medianHouseValue: Median house value for households within a block (measured in US Dollars)

  10. oceanProximity: Location of the house w.r.t ocean/sea

In the first section, we will implement Principal Component Analysis to reduce the dimensionality of the data, by identifying the main underlying components which explain the majority of the variability.
Next, we will use the retained components to perform both hierarchical and non hierarchical clustering. This will allow to group houses according to the most important uncorrelated criteria.

We can begin by visualizing the data.
To get an idea of the distribution of the variables we plot the histograms.

Next, since the dataset includes spatial information regarding longitude and latitude, we also plot the observations on a map, coloring them by our categorical variable “ocean proximity”.

Ocean Proximity
NEAR OCEAN
<1H OCEAN
ISLAND
NEAR BAY
INLAND

PCA

For the purposes of PCA we exclude the categorical variable “ocean proximity” from our dataset. We will include it again when performing hierarchical clustering.

Standardizing variables

In order to proceed with PCA, we need to perform variables’ scaling, otherwise the process would be biased by the different unit of measures. It is possible to check that the scaling was succesful by verifying that for each variable the mean is equal to 0, and standard deviation is equal to 1.

##                    means sds
## longitude              0   1
## latitude               0   1
## housing_median_age     0   1
## total_rooms            0   1
## total_bedrooms         0   1
## population             0   1
## households             0   1
## median_income          0   1
## median_house_value     0   1

Eigenvalues and variance explained

Component matrix: interpretation

##                         Comp1      Comp2      Comp3      Comp4
## longitude           0.1510000  0.9180000  0.3230000  0.0380000
## latitude           -0.1500000 -0.9570000 -0.1660000  0.0820000
## housing_median_age -0.4280000 -0.0010000 -0.0650000 -0.8890000
## total_rooms         0.9580000 -0.0840000 -0.1110000 -0.0280000
## total_bedrooms      0.9680000 -0.1000000  0.0540000 -0.1180000
## population          0.9300000 -0.0640000  0.1020000 -0.1150000
## households          0.9710000 -0.1000000  0.0340000 -0.1380000
## median_income       0.1100000  0.2470000 -0.8740000  0.2000000
## median_house_value  0.0890000  0.2670000 -0.8770000 -0.1580000
## % of VAR explained  0.4347663  0.2135917  0.1885702  0.1011284
  • the first component correlates the most with: total_rooms, total_bedrooms, population, and households.
    It can be conceptualized as block popolousness as all the four variables display a high value when the number of people in a block is large, thus the number of spaces (bedrooms and rooms) is high;
  • the second component correlates the most with longitude and latitude.
    It can be conceptualized as position: the higher the component 2 is, the higher the longitude and the lower the latitude;
  • the third component correlates the most with median income and median house value.
    It can be conceptualized as low wealth: the higher the component 3 is, the lower the median income and the median house value are;
  • the fourth component correlates the most with median house age.
    It can be conceptualized as how recent are the houses in a block: the higher the component 4 is, the lower the house median age.

The sum of the squares of the values of each row of the component matrix is the respective communality.

##                     Comp1  Comp2  Comp3  Comp4 communality
## longitude           0.151  0.918  0.323  0.038    0.971298
## latitude           -0.150 -0.957 -0.166  0.082    0.972629
## housing_median_age -0.428 -0.001 -0.065 -0.889    0.977731
## total_rooms         0.958 -0.084 -0.111 -0.028    0.937925
## total_bedrooms      0.968 -0.100  0.054 -0.118    0.963864
## population          0.930 -0.064  0.102 -0.115    0.892625
## households          0.971 -0.100  0.034 -0.138    0.973041
## median_income       0.110  0.247 -0.874  0.200    0.876985
## median_house_value  0.089  0.267 -0.877 -0.158    0.873303
  • Communality is the total amount of variance an original variable shares with all the components included in the analysis
  • Most of the variance of each variable is explained by the four components that we kept in the analysis

princomp:

## Importance of components:
##                           Comp.1    Comp.2    Comp.3    Comp.4    Comp.5
## Standard deviation     1.9781042 1.3864795 1.3027401 0.9540206 0.5413259
## Proportion of Variance 0.4347663 0.2135917 0.1885702 0.1011284 0.0325593
## Cumulative Proportion  0.4347663 0.6483580 0.8369282 0.9380566 0.9706159
##                            Comp.6      Comp.7      Comp.8      Comp.9
## Standard deviation     0.37788200 0.249770513 0.210915373 0.121621662
## Proportion of Variance 0.01586609 0.006931701 0.004942811 0.001643537
## Cumulative Proportion  0.98648195 0.993413653 0.998356463 1.000000000

This plot represents how each observation and component vectors take place in the 4D component space: we tried to reach an optimal representation by plotting the first three components in a 3 dimensional space, then the fourth is depicted as the intensity of the two extreme colors that we show in the legend.
This graph can be interpret as follows:

  • the more we go in the direction of each arrow, the more we go towards the region in which the respective variable has a large value;
  • the darker the color, the more recent the house.

From this plot we can notice the presence of three possible clusters: two are placed at the extremes of the cloud of points and the third is smaller and placed between the two others.

  • housing_median_age has the darkest arrow: this means that the fourth component negatively correlates with this variable, as we told before.
  • longitude and latitude point in opposite directions: this means that the second component correlates with opposite signs with the two variabes.
  • median_house_value and median_house_income point in the same direction: indeed, we can recognize this direction to be representative of the third component.
  • Lastly, total_rooms, total_bedrooms, population, and household point in the same direction, indicating the pointing direction of the first component.

K-means

We will now proceed with non-hierarchical clustering, in particular we will perform the k-means method.

## [1] 0.8369282

The first three components retain 83.69% of the total variance, so use the three components for Kmeans Clustering Analysis.

Choose number of the clusters

When the number of clusters is 4, Calinski-Harabasz(CH) Index reaches a peak, so we choose it as the number of clusters for Kmeans.

4 clusters with Kmeans

## 
##    1    2    3    4 
## 7067 2970 8865 1531

Cluster Visualization

Interpretation of Clusters

We can visualize the boxplots for the distribution of the 4 components retained in PCA across the 4 identified clusters.

Visualization of clusters

Clusters
1
2
3
4

Difference:

  • Cluster1: negative values in component2
  • Cluster2: negative values in component3
  • Cluster3: positive values in component3, mostly positive values in component2
  • Cluster4: positive values in component1

Combined with the interpretation of PCA:

  • Cluster1: Component2 correlates the most with longitude and latitude, the higher the component 2 is, the higher the longitude and the lower the latitude, the samples in cluster1 shows higher negative values, meaning that they have lower longitude and higher latitude, and as shown the map of California, they are all the houses in the northwest.

  • Cluster2: This cluster shows negative values in component3. Component3 correlates the most with median income and median house value, the higher the component 3 is, the lower the median income and the median house value are. So houses in cluster 2 have higher median income and median house values. And in the map, we can find that cluster2 appear along the coastline, meaning this kind of houses have higher values.

  • Cluster3: Opposed to Cluster2, cluster3 has positive values in component3, thus houses in cluster 3 have lower median income and median house values. And since most of them have positive values in component2, they appear mostly in southeast of the map.

  • Cluster4: This cluster shows higher positive values in component1. Component1 positively correlates the most with total_rooms, total_bedrooms, population and households. So samples in cluster 4 are houses with high block populousness.

In summary:

  • Cluster1: houses in the northwest
  • Cluster2: houses with higher values
  • Cluster3: houses in the southeast and with lower values
  • Cluster4: houses with high block populousness

Hierarchical Clustering

We are now going to implement hierarchical clustering, exploiting Gower’s Distance, which allows to compute the similarity between observation based on both numerical and categorical variables. In fact, in addition to the three components found with PCA (popolousness, position, wealth), the following analysis also considers the variable ocean proximity, which conceptualizes the distance from the ocean through five levels; in particular it takes into account San Francisco’s Bay (NEAR BAY).

Computing the distances

We compute Gower’s index to find the distances among the observations. As a technical sidenote, we also use the parallel package to speed up the computation by performing it on 3 cores.
The output is a distance matrix, which of course has a number of rows and columns equal to the number of observations. For simplicity, we only display the first 5 rows and columns.

## sequential:
## - args: function (..., envir = parent.frame())
## - tweaked: FALSE
## - call: NULL
##           [,1]      [,2]      [,3]      [,4]      [,5]
## [1,] 0.0000000 0.2833059 0.2546131 0.2291734 0.5183503
## [2,] 0.2833059 0.0000000 0.5250509 0.2922206 0.5510373
## [3,] 0.2546131 0.5250509 0.0000000 0.4421927 0.4845992
## [4,] 0.2291734 0.2922206 0.4421927 0.0000000 0.5150133
## [5,] 0.5183503 0.5510373 0.4845992 0.5150133 0.0000000

Implementing Ward’s method

Among all the linkage methods we selected Ward’s, as it provides the most balanced clusters.
We decided to cut the dendrogram at a height of 10, thus retaining 3 clusters. It could also be possible to split the data in 5 groups, but the clusters were not so well defined.

Clusters’ numerosity

## 
##    1    2    3 
## 1108 1736 2156

2D representation

We can visualize the distribution of the clusters along the two main components, which are the two dimensions explaining most of the variablity in the data.
The clusters are indeed quite balanced and well separated.
There are some houses displaying a particularly high value for component 1, which could be considered outliers, but for the scope of this project we decide to retain them, by assigning them to the closest cluster.

Variables’ distribution in the clusters

Continuous variables

In order to see how the components behave in the three clusters, we display the density plots of their distribution.

From these plots we can observe the following:

  • component 1 is concentrated around 0 for groups 1 and 3, while group 2 is more dispersed, and has a higher mean value.
  • component 2 well distinguishes cluster 1 from 2 and 3, and brings attention to the fact that groups 2 and 3 have a similar mean value but, as we are going to see on the map, the houses belonging to group 2 have a higher variability in latitude and longitude, as they are located all along the coast.
  • component 3 shows the highest diversity among the three means, identifying three different levels of wealth.

Categorical variable

To visualize the relation between the groups and the qualitative variable “ocean_proximity” we can display a barchart with values expressed in percentage. In this way we can more effectively compare the distribution of the categories, given the different numerosity of the clusters.

Spatial visualization

Thanks to the geographical coordinates we can plot our data on California’s map, and confirm our previous findings.

Clusters
1
2
3

Clusters description

In order to identify the peculiarities of each group, we can compute the means of the three components and compare them according to the related concepts.

##   Group.1 component1 component2  component3
## 1       1  1.6319536  0.4899765 -1.41564165
## 2       2 -0.6951695 -1.5132952  0.02713028
## 3       3 -0.2476038  0.9933227  0.70674544

Group 1:

  • It includes houses with the most populous neighborhoods.
  • The houses are quite spreaded along the South and Central coast.
  • The inhabitants are the most wealthy.
  • In particular, this group is the most concentrated around San Francisco’s Bay.

Group 2:

  • This group has the lowest popolousness for each block.
  • The houses in this cluster have the lowest longitudinal values, in fact they are all located in the North-West of the country.
  • The median households income and the houses value in this group are in the middle between the other two clusters.

Group 3:

  • The houses of this group don’t have a very concentrated population, but still a higher value than group 2.
  • It’s houses are located in the South-East of California, and they are distributed quite close to the ocean.
  • The inhabitants are those with the lowest income.